JBoss Community Archive (Read Only)

RHQ 4.10

Design - LogAnalysis 2

This page is to discuss the next version of the log and event tracing system. The collection system is fairly robust and works well with a number of different scenarios already (syslog, windows event logs, log4j files) and can easily be written to handle network listener based events or perhaps code enhancement based eventing. Where we are currently lacking is in the long term persistent storage and analysis features.

Storage

For storage, first we need to get out of the database and into a distributed store to disk model with compression and indexing. This screams for a solution like HDFS. Others have already solved this problem using HDFS and there even exists the Chukwa project that is designed to implement a self contained solution. Chukwa is interesting because it is a solution already in use and has tackled many of the necessary topics. One downside may be the difficulty in integrating with the RHQ infrastructure to gain the most value from the cohesive management strategy tying other types of data together. E.g. the rhq agent already manages itself, upgrades itself, has reliable, secure delivery, and collects log and events data in a way that attaches it to our deep inventory system. 

Analysis and Reporting

We want to be able to perform both large scale searches of the data as well as correlated analysis of specific message rates or transformation to quantitative data sets. Externalizing the data set would remove it one step from our inventory system which may make some of the analysis more difficult. (e.g. Search the logs for this "cluster" which is an RHQ inventory concept) It also would need considerable attention on the alerting integration for best results.

Requirements

Technology

Data stores on the table (so far) would probably be cassandra, voldemort and hbase... preferably with an integration with lucene. Lucandra may be a good option for us. We were also pointed at ElasticSearch and should take a look at that too.

Strategies

For computing event recurrence we can do simple grouping by entire message indexing (with at a minimum numbers removed). If we want to have more fuzzy grouping of messages that are almost the same we could use map-reduce to calculate Levenshtein distances to judge message similarity. (e.g. Removing numbers lets us group "Compression took [823ms]" and "Compression took [2423ms]")  but (e.g. "User [foo] changed password" and "User [bar] changed password" would not be grouped)

Once messages are grouped, message that have interesting numeric trend data could be separated for time series plots (e.g. plotting the compression time logged above)

Cassandra

A bit of analysis on the use of cassandra for this.

Good

  • Decentralized (no central node to be a bottleneck or fail)

  • Linear horizontal scaling of writes

  • Datacenter (geographic) partitioning support with pluggable semantics so we can align its storage with our affinity model

  • Tunable consistency semantics gives flexibility in AP vs CP

  • Lucandra is *at a minimum *an example of Lucene integration with Cassandra and maybe something we can use and contribute to

  • Column family (i believe) gives us the best flexibility on log data splitting and optimized retrieval 

  • map-reduce and pig integration

  • Range queries

  • Should be easy to embed in our Server VM

Bad

  • Cassandra does not yet have compression (though planned for 0.8 I believe)

  • The Thrift based API does not have streaming

Functional Designs

The following two diagrams show some ideas for displaying analyzed frequencies for important events.

images/author/download/attachments/78710295/EventDesign_1.png images/author/download/attachments/78710295/EventDesign_2.png

JBoss.org Content Archive (Read Only), exported from JBoss Community Documentation Editor at 2020-03-11 13:55:02 UTC, last content change 2010-08-02 19:36:47 UTC.